A Sub-Character Architecture for Korean Language Processing
نویسنده
چکیده
We introduce a novel sub-character architecture that exploits a unique compositional structure of the Korean language. Our method decomposes each character into a small set of primitive phonetic units called jamo letters from which characterand word-level representations are induced. The jamo letters divulge syntactic and semantic information that is difficult to access with conventional character-level units. They greatly alleviate the data sparsity problem, reducing the observation space to 1.6% of the original while increasing accuracy in our experiments. We apply our architecture to dependency parsing and achieve dramatic improvement over strong lexical baselines.
منابع مشابه
Dual Long Short-Term Memory Networks for Sub-Character Representation Learning
Characters have commonly been regarded as the minimal processing unit in Natural Language Processing (NLP). But many non-latin languages have hieroglyphic writing systems, involving a big alphabet with thousands or millions of characters. Each character is composed of even smaller parts, which are often ignored by the previous work. In this paper, we propose a novel architecture employing two s...
متن کاملبازشناسی برخط حروف مجزای دستنویس فارسی بر اساس تشخیص گروه بدنه اصلی با استفاده از ماشین بردار پشتیبان
In this paper a new method for the online recognition of handwritten Persian characters has been proposed which uses a set of simple features and Support Vector Machine (SVM) as a classifier. The task of preprocessing allows us to equalize feature vectors from different characters. This algorithm is implemented in two steps. In the first step, input character is classified into one of eighteen ...
متن کاملThe Machine Translation Researches and Governmental View in Korea
Viewed from a broad perspective, in the seventies when we studied the basic technologies of NLP as a groundwork of MT, the focus of research was given to describing various phenomena of the Korean language in a linguistically significant way and processing the Korean characters mathematically or specific phenomena of the language logically with a computer. The theoretical linguistic description...
متن کاملA new model for persian multi-part words edition based on statistical machine translation
Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use space character instead of half-space character. This common incorrectly use of space leads to some s...
متن کاملSKOPE: A connectionist/symbolic architecture of spoken Korean processing
Spoken language processing requires speech and natural language integration. Moreover, spoken Korean calls for unique processing methodology due to its linguistic characteristics. This paper presents SKOPE, a connectionist/symbolic spoken Korean processing engine, which emphasizes that: 1) connectionist and symbolic techniques must be selectively applied according to their relative strength and...
متن کامل